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1.  INTRODUCTION 


Recent  advances  in  semiconductor  device  fabrication  technology  have  made  possible  the  realiza¬ 
tion  of  increasingly  complex  digital  integrated  circuits.  VLSI  computation  theory  addresses  the  prob¬ 
lem  of  efficiently  using  this  chip  complexity  (as  measured  by  area)  in  order  to  decrease  computation 
time.  The  search  for  efficient  use  of  area  and  time  resources  has  borne  certain  classes  of  architectures 
that  are  repeatedly  utilized. 

One  such  class  is  that  of  systolic  arrays.  Systolic  arrays  have  been  described  by  H.  T.  Kung  and  C 
E  Leiserson  [Si 

"A  systolic  system  is  a  network  of  processors  ■which  rhythmically  compute  and  pan  data  through  the 
system....  Every  processor  regularly  pumps  dam  in  and  out,  each  time  performing  some  short  computa¬ 
tion.  so  that  a  regular  flow  of  data  is  maintained  in  the  network.* 

Systolic  networks  exhibit  regular  and  modular  layouts.  In  addition,  interprocessor  connections  are 
bounded  in  number  and  localized  in  space.  These  features  make  systolic  architectures  particularly  well 
suited  to  the  planar  format  imposed  by  VLSI  technology.  Systolic  architectures  also  interface  with  con¬ 
ventional  computer  memories  in  a  natural  and  efficient  way.  A  number  of  systolic  arrays  have  been 
proposed  by  a  number  of  authors  [1.23,4,6,7,8,10,111.  These  arrays  seem  mast  promising  in  certain 
numerical  computations;  proposed  applications  include:  discrete  convolution,  matrix  multiplication,  LU 
and  other  matrix  decompositions;  triangular  system  solution,  and  other  related  computations. 

The  systolic  arrays  that  have  been  proposed  thus  far  share  many  common  features.  However, 
there  has  been  little  theory  unifying  these  designs,  and  most  have  been  presented  ad  hoc,  without 
detailed  analysis.  Here,  we  take  the  first  steps  toward  the  development  of  a  theoretical  framework  to 
unify  the  analysis  and  synthesis  of  systolic  networks.  We  describe  a  class  of  transformations  on  sys¬ 
tolic  networks  that  alter  the  topology  of  a  network  while  preserving  the  timing  of  its  computations. 
These  transformations  may  be  used  to  demonstrate  the  equivalence  of  two  existing  systolic  designs  or  to 
obtain  a  new  design  from  an  existing  one. 

This  thesis  is  organized  as  follows;  In  the  second  chapter,  we  discuss  our  model  of  a  systolic  net¬ 
work  and  identify  the  parameters  that  we  will  use  to  characterize  one.  In  the  third  chapter,  we  present 


2.  MODEL  OF  A  SYSTOLIC  NETWORK 

2.1.  MODEL  OF  PROCESSING  ELEMENTS 

A  systolic  network  can  be  viewed  as  a  collection  of  processing  elements  (PEs)  located  at  vertices  of 
a  multidimensional  uniform  grid.  Each  of  the  PEs  can  be  partitioned  (Figure  1)  into  a  control  machine, 
M,  and  a  computation  machine ,  N.  Both  of  these  can  be  considered  finite  state  machines;  however, 
references  to  the  state  of  a  PE  will  actually  apply  only  to  the  state  of  the  control  machine.  The  state  of 
the  computation  marhing  is  the  contents  of  its  data  registers  and  will  be  considered  later. 


control 

machine 


computation 

machine 


Figure  1.  General  model  of  a  systolic  PE 


The  control  machine  effects  the  correct  state  transitions  for  the  PE,  transmits  the  appropriate  control 
signals  to  neighboring  PEs,  and  controls  the  functions  computed  by  the  computation  machine.  In  short, 
the  control  machine  performs  the  synchronization  and  sequencing  activities  necessary  to  coordinate  the 
operation  of  the  systolic  network.  The  computation  machine  on  the  other  hand,  operates  on  the  data 
flowing  through  the  network.  It  performs  the  arithmetic  involved  in  computing  the  actual  output  of 
the  network.  Using  the  formalism  of  finite  state  machines,  we  have 

s  *  (s,Cj),  Cq  *  Xm(s,Q),  R  *  Dq  *  \n(RJ3|,s), 


where  s,  R.  C,  and  D  are  the  PE  state,  the  PE  register  contents;  the  control  signals;  and  the  data. 
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respectively.  Also,  the  subscripts  I  and  O  denote  input  and  output,  respectively,  and  8  and  X  are  respec¬ 
tively  the  state  transition  and  output  functions. 

The  PE s  in  a  systolic  network  are  selected  from  a  set  of  possible  module  types.  Some  systolic  net¬ 
works  utilize  only  one  type  of  module,  while  others  utilize  as  many  as  four  types.  In  all  cases,  how¬ 
ever,  the  number  of  module  types  is  independent  of  the  network  size  and  is,  instead,  determined  by  the 
type  of  computation  performed  by  the  network.  Each  module  type  is  characterized  by  a  module 
description.  The  module  description  specifies  the  states  in  which  a  PE  of  this  module  type  can  be.  Each 
of  these  states,  in  turn,  is  characterized  by  a  state  description.  A  state  description  consists  of  a  collec¬ 
tion  of  assignment  statements  and  control  statements  to  be  executed  by  a  PE  in  this  state  at  the  end  of 
every  clock  cycle.  The  assignment  statements  are  written  in  a  register  transfer  language  and  dictate  the 
operation  of  the  computation  machine;  they  indicate  which  input  ports  or  registers  serve  as  sources  of 
operands,  which  operations  are  performed  on  the  operands,  and  which  output  ports  or  registers  serve  as 
destinations  for  the  results.  The  control  statements  dictate  the  operation  of  the  control  machine;  they 
indicate  state  transitions  to  be  executed  by  the  PE  or  its  neighbors.  A  state  transition  to  be  executed  by 
a  neighbor  must  be  initiated  by  the  PE  via  control  signals. 

2SL  MODEL  OF  DATA  FLOWS 

The  locations  of  the  PEs  in  the  uniform  grid  are  constrained  in  two  ways.  Fuat,  consider  the  sub¬ 
set  of  PEs  belonging  to  a  particular  module  type.  We  will  require  that  the  set  of  positions  occupied  by 
these  PEs  forms  a  lattice.  A  set,  P,  of  points  on  a  uniform  grid  is  a  lattice  if 
p  a  {p  |  p  =  Lg  +  d,  g  6  G),  where  G  is  the  set  of  unit-grid  points  contained  in  a  closed  convex 
domain  (G  is  referred  to  briefly  as  a  convex  grid  set\  L  (the  distortion  matrix)  is  a  matrix  of  rational 
numbers  mapping  each  unit-grid  point  to  a  uniform-grid  point,  and  d  is  a  fixed  grid  point  (the  origin ); 
Le*  P  must  be  the  result  of  an  affine  transformation  on  a  convex  grid  set,  G,  (Figure  2\  The  notation  p 
is  used  here  to  represent  a  d-component  vector  [pi  P2  •  •  •  PdF.  where  d  is  the  dimension  of  the  uni¬ 
form  grid  in  which  the  PEs  are  located. 


IS*** 

.****! 


Figure  2.  Example  of  a  lattice  in  two  dimensions, 
the  underlying  convex  grid  set  (G), 
and  the  distortion  matrix  (L) 


Secondly,  consider  a  PE  at  position  p.  If  this  PE  receives  input  data  directly  (in  unit  time)  from 
another  PE  at  position  p— v  (v?*0),  then,  in  order  to  maintain  a  regular  flow  of  data,  we  will  require 
that  it  be  able  to  transmit  output  data  directly  (in  unit  time)  to  a  PE  at  position  p+v  (except,  possibly, 
for  boundary  PEs).  Thus,  there  must  exist  a  PE  at  p+v,  a  directed  edge  for  data  communication  with 
terminals  (p— v,p),  and  one  with  terminals  (p,p+v). 

If  we  associate  all  the  data  that  either  flow  along  communication  edges  with  a  particular  length 
and  orientation  or  reside  in  a  particular  set  of  computation  machine  registers,  this  will  define  a  data 
flow.  Formally,  a  data  flaw  is  a  pair  C  =  <A,0>  defined  as  follows 

1.  A  is  a  set  of  data  (inputs,  outputs,  or  intermediate  results  of  a  systolic  computation); 

2.  N,  the  set  of  natural  integers,  corresponds  to  the  set  of  instants  of  discrete  time  (t  €  IN  is  the  index 

of  the  time  unit); 

1  0:AxN-*Uisan  injective  function  of  the  data  set  and  discrete  time  that  maps;  at  any  given  t, 
each  of  the  elements  of  A  to  a  position  on  the  uniform  grid,  U.  Denote  by  0(A,t)  the  range 


of  0  at  time  t.  Function  0  is  further  constrained  as  follows; 

a.  &*?}  =  ▼  for  all  a  €  A  and  all  k.t  6  N 
k 

(iA,  all  elements  of  A  move  with  the  same  constant  velocity,  v). 

b.  0(A.t)  must  be  a  lattice  for  any  integer  t. 

Combining  (3a)  and  (3b),  we  obtain  0(A,t)  -  (p  I  p  =  Lg  +  d  +  tv,  g  €  G}.  Thus,  we  can  define 
an  injective,  time  independent  function,  y,  which  maps  elements  of  A  to  a  position  in  G,  a  given  convex 
grid  sec 

y :  A-*G. 


Therefore,  0(a.t)  =  Ly(a)  +  d  +  tr. 
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3.  TRANSFORMATION  OF  DATA  FLOWS 

3.1.  MATHEMATICAL  DEFINITION 

In  this  section,  we  define  a  class  of  transformations  on  data  flows  that  alter  the  topology  of  a  sys¬ 
tolic  network  without  changing  the  timing  of  its  data  interaction.  Other  authors  have  suggested  simi¬ 
lar,  but  somewhat  different  transformations.  In  particular,  Leiserson  and  Saxe  have  proven  a  “Systolic 
Conversion  Theorem"  that  converts  nonsystolic  networks  to  systolic  networks  [9J.  However,  their 
conversion  is  effected  by  retiming  the  computations  of  a  network  without  changing  the  topology  of  the 
underlying  communication  graph.  Cappello  and  Steiglitz  have  also  suggested  transformations  to  unify 
the  design  of  systolic  networks  [5J.  They  have  described  linear  transformations  of  space-time  that  are 
capable  of  altering  both  network  topology  and  computation  timing.  As  we  shall  explain  later,  the 
transformations  described  by  Cappello  and  Steiglitz  are  especially  similar  to  ours,  which  are  character¬ 
ized  by  the  two  following  theorems. 

Theorem  Is 

A  constant  vector,  o.  may  be  added  to  all  data  flow  velocities  without  altering  the  data  flow 
origins,  the  distortion  matrices,  y,  or  the  tuning  of  the  computations. 

Proof:  Consider  two  arbitrary  elements  of  two  different  data  flows,  x  6  X  and  y  €  Y,  that 
must  interact  at  some  time  t.  This  constrains  X  and  Y  to  satisfy  <£(x,t)  -  0(y,tl  If  L*,  d*, 
and  vx  denote  the  distortion  matrix,  the  origin,  and  the  velocity  of  data  flow  X,  respectively, 
and  Ly,  dy,  and  vy  denote  the  corresponding  entities  of  data  flow  Y,  we  can  write  this  con¬ 
straint  as: 

L^Cx)  +  d,  +  trK  =  Lyyy(y)  +  dy  +  tvr 

We  now  show  that  if  and  vy'  are  the  transformed  velocities  of  the  X  and  Y  data  flows, 
respectively,  the  above  constraint  is  still  satisfied. 

Let  v,'  =  v„  +  a,  Ty'  =  vy  +  n: 
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L*y*&)  +  d,  +  tvx  =  LyVyCy)  +  dy  +  try 
LxVxCx)  +  d,  +  tvx  +  tu  =  Ly>y(y)  +  dy  +  tvy  +  tu 
Lj^x)  +  d,  +  t(vx  +  u)  *  Lyyy(y)  +  dy  +  t(Ty  +  u) 

L,y*(i)  +  d*  +  tvx'  =  Ly>y(y)  +  dy+tvy' 

Furthermore,  since  this  transformation  is  invertible,  no  additional  interactions  are  intro* 
duced  by  it,  Le,  there  is  a  one-to-one  correspondence  between  the  data  interactions  in  the  ori¬ 
ginal  network  and  the  data  interactions  in  the  resultant  network.  □ 

Theorem  2: 

All  of  the  data  flow  velocities,  origins,  and  distortion  matrices  may  be  multiplied  by  a  non¬ 
singular  matrix,  M,  without  altering  y  or  the  timing  of  the  computations: 

Proof:  Again,  we  consider  two  arbitrary  interacting  elements  and  show  that  if  vx'  and  vy' 
are  the  transformed  velocities,  d,'  and  dy'  are  the  transformed  origins,  and  L*’  and  Ly'  are 
the  transformed  distortion  matrices  of  the  X  and  Y  data  flows,  respectively,  the  positional 
constraint  is  still  satisfied. 

Let  vx'  =  Mv„  vy'  =  Mvy,  d*'  =  Md„  dy'  =  Mdy,  L*'  =  ML,,  Ly'  =  MLyi 

L,yx(x)  +  d*  +  tvx  =  Ly^yCy)  +  dy  +  tvy 
MCL^Cx)  +  d,  +  tvx)  =  M(Lyyy(y)  +  dy  +  tvy) 

MLxyx(x)  +  Md,  +  tMvx  =  MLyyy(y)  +  Mdy  +  tMvy 
Lx'yxCx)  +  d*'  +  tvx'  =  Ly'yy(y)  +  dy'  +  tvy' 

Since  M  is  nonsingular,  this  transformation  is  also  invertible,  and,  again,  there  is  a  one-to- 
one  correspondence  between  the  data  interactions  in  the  original  network  and  the  data 
interactions  in  the  resultant  network.  □ 

These  two  theorems  provide  a  simple,  yet  powerful  set  of  rules  for  transforming  systolic  net¬ 
works.  Networks  that  are  equivalent  under  these  transformations  are  said  to  be  a  finely  equivalent  or 
simply  equivalent.  Networks  that  are  equivalent  under  transformations  of  the  second  type  alone  (those 
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described  by  Theorem  2)  are  said  to  be  linearly  equivalent.  In  general,  these  affine  transformations 
result  in  topological  changes  in  the  network,  but,  if  we  restrict  them  to  be  linear,  they  result  only  in 
"conformal"  changes  in  the  network,  i-e,  dilation,  contraction,  rotation,  or  reflection-  The  intermediate 
results  of  such  transformations  are  arbitrary,  but  the  final  re-alt  must  represent  a  valid  set  of  data 
flows;  in  particular,  each  of  the  velocities  and  distortion  matrices  must  consist  of  rational  elements.  In 
subsequent  portions  of  this  thesis,  we  will  tacitly  ignore  the  data  flow  origins  since  these  parameters  can 
easily  be  obtained  after  transformation  through  initial  condition  considerations. 

As  was  noted  previously,  the  affine  transformation  of  data  flow  parameters  is  very  similar  to  the 
linear  transformation  of  space-time  as  described  by  Cappello  and  Steiglitz.  In  fact,  in  cases  where  both 
can  be  applied,  our  affine  transformations  are  a  special  case  of  the  Cappello-Steiglitz  transformations. 
Specifically,  Theorem  1  describes  transformations  that  Cappello  and  Steiglitz  would  represent  by  the 
matrix 

I  u 
0T  1  ’ 

while  Theorem  2  describes  transformations  that  they  would  represent  by  the  matrix 

M  0 
0T  1  ' 

(The  last  coordinate  is  taken  to  be  time.)  We  feel,  however,  that  this  loss  of  generality  is  compensated 
by  the  following  considerations.  First,  the  affine  transformations  give  the  designer  more  of  a 
"kinematic"  intuition  of  the  design  process  and  are  simpler  to  use  if  one  is  given  a  systolic  network  a 
priori.  The  Cappello-Steiglitz  transformations,  on  the  other  hand,  require  the  geometric  description  of 
an  algorithm.  Second,  the  set  of  systolic  networks  is  closed  under  affine  transformation.  However, 
Cappello-Steiglitz  transformations  may  yield  designs  that  are  unrealistic  in  the  VLSI  model  of  compu¬ 
tation,  e.g,  designs  with  unbounded  fan-in  or  fan-out.  (This  is  necessary,  of  course,  in  contexts  where 
such  designs  are  to  be  studied.)  Finally,  as  we  shall  see,  affine  transformations  may  also  be  used  to 
derive  the  module  descriptions  of  a  transformed  network.  This  task  becomes  nontrivial  when  dealing 


with  systolic  networks  having  multiple  module  types  or  a  module  type  with  multiple  states. 

3.2.  CANONICAL  REPRESENTATION 

Since  affine  transformations  can  yield  a  number  of  equivalent  systolic  networks,  it  will  be  useful 
to  distinguish  a  canonical  network.  The  typical  systolic  computation  is  defined  in  terms  of  an  opera¬ 
tion.  a  set  of  one  or  more  operands,  a  result,  and,  possibly,  a  set  of  side  conditions.  For  a  particular  com¬ 
putation,  a  systolic  network  has  inputs  and  outputs  corresponding  in  some  way  to  the  operands  and  the 
result.  This  correspondence  is  unrestricted,  Le^  inputs  need  not  correspond  to  operands,  and  outputs 
need  not  correspond  to  the  result.  The  systolic  network  computes  the  outputs  (whether  they  are  results 
or  operands)  so  that  the  result  is  consistent  with  the  operation  on  the  operands,  subject  to  any  existing 
side  conditions. 

With  this  in  mind,  we  define  the  canonical  network  as  one  in  which  the  result  data  flow  has  zero 
velocity  and  an  identity  distortion  matrix.  All  systolic  networks  can  then  be  represented  as  a  two  step 
transformation  of  a  canonical  design:  first,  we  add  a  vector  to  all  data  flow  velocities  of  the  canonical 
design  (according  to  Theorem  l);  second,  we  multiply  all  data  flow  velocities  and  distortion  matrices  by 
a  nonsingular  matrix  (according  to  Theorem  2\  This  representation  will  be  called  the  canonical 
representation  of  the  network.  Thus;  a  set  of  networks  that  share  the  same  canonical  network  is  an 
affine  equivalence  class,  and  a  set  of  networks  that  share  the  same  canonical  network  and  the  same  first 
step  of  their  canonical  representations  is  a  linear  equivalence  class. 

33.  TRANSFORMATION  OF  STATE  FLOWS 

At  first,  one  might  suspect  that  the  module  descriptions  of  a  systolic  network  resulting  from  an 
affine  transformation  cannot  be  recovered  easily  from  the  module  descriptions  of  the  original  network. 
Conceptually,  though,  it  is  quite  simple.  Let  us  replace  each  of  the  PEs  of  the  original  array  with  an 
emulator,  E,  that,  when  given  an  input  and  a  state  chosen  from  the  union  of  all  possible  states  of  all 
module  types,  computes  the  output  generated  by  a  PE  in  the  given  state  upon  receipt  of  that  input.  The 
emulator  is  essentially  a  computation  machine  general  enough  to  compute  any  function  of  any  module 


type,  and  the  current  state  is  simply  one  of  the  inputs  to  the  emulator  (Figure  3). 


state  from 


state  to 


Figure  3.  General  model  of  a  systolic  emulator 

The  equations  for  the  emulator  are  R'  =  S^RJDj.s)  and  D0  =  The  PE  register  contents  (R) 

and  the  input/output  data  (Dj  and  Dq)  are  data  flow  elements.  (Register  contents  are  elements  of  data 
flows  with  zero  velocity.)  Collectively,  the  states  can  be  thought  of  as  another  data  flow,  which  we 
will  refer  to  as  the  state  flow.  This  state  flow  can  be  transformed  along  with  the  other  data  flows. 
After  the  transformation,  the  peculiarities  of  the  new  state  flow  may  suggest  new  module  types  that 
may  be  used  to  replace  the  emulators  with  hard-wired  state  transition  control  machines. 

In  order  to  carry  out  a  transformation  on  the  state  flow,  it  is  first  necessary  to  have  the  module 
descriptions  in  an  appropriate  format.  First,  we  assign  a  distinct  label  to  each  state  of  every  type  of 
module.  This  collection  of  labeled  states  forms  the  state  alphabet  from  which  elements  of  the  state 
flow  are  selected.  Second,  we  modify  the  state  descriptions  by  removing  all  control  statements  and 
replacing  input/output  port  and  register  designations  in  the  assignment  statements  with  the  appropriate 
data  flow  labels.  Control  statements  can  be  eliminated  since  the  state  flow  now  provides  state  transition 
information  to  module  emulators,  while  port  and  register  relabeling  is  necessary  since  data  flow  veloci¬ 
ties  are  not  preserved  under  affine  transformation.  For  instance,  data  flowing  from  west  to  east  in  the 
original  network  may,  after  a  suitable  affine  transformation,  flow  from  south  to  north  in  the  resulting 
network.  Similarly,  a  data  flow  that  is  stationary,  Le^  contained  in  module  registers,  may  move  from 
south  to  north  after  transformation.  Thus,  references  to  register  names,  east  and  west  ports,  or  north 


and  south  ports  become  meaningless  after  transformation.  Any  ambiguity  in  reference  to  input/output 
ports  or  registers  by  data  flow  label  will  be  resolved  in  the  following  manner  a  data  flow  label  on  the 
right-hand  side  of  an  assignment  arrow  will  refer  to  the  port  through  which  elements  of  this  data  flow 
enter  the  module  or,  if  the  data  flow  is  stationary,  the  register  in  which  elements  of  it  are  contained;  a 
data  flow  label  on  the  left-hand  side  of  an  assignment  arrow  will  refer  to  the  port  through  which  ele¬ 
ments  of  this  data  flow  exit  the  module  or,  if  the  data  flow  is  stationary,  the  register  in  which  elements 
of  it  are  contained. 

Abstracting  the  state  flow  from  the  control  statements  executed  by  the  PEs  is  basically  a  problem 
in  the  theory  of  cellular  automata.  We  know  of  no  formalized  or  algorithmic  approach  to  this  prob¬ 
lem;  however,  most  of  the  instances  encountered  with  systolic  processing  require  only  a  limited  effort 
and  can  be  determined  by  inspection.  The  parameters  to  be  determined  are  the  velocity,  a  suitable  con¬ 
vex  grid  set,  and  a  distortion  matrix.  Once  these  are  determined,  they  can  be  transformed  along  with 
the  parameters  of  the  data  flows  according  to  rules  of  Theorems  1  and  2. 

3.4.  EXAMPLES  OF  TRANSFORMATIONS 

3-4.1.  DISCRETE  OPEN  CONVOLUTION 

Let  us  consider  the  problem  of  discrete  open  convolution,  which  is  stated  simply  as  follows:  given 
a  sequence  of  weights,  W,  and  a  sequence  of  inputs,  X,  compute  the  result  sequence,  Y  =  W  *  X.  where 

the  operation  "*"  is  defined  as  Yti]  =  £  W(j]  x  Xli— jj.  Usually,  W,  X,  or  both  have  finite  length,  so  the 

j 

summation  has  finite  bounds. 

Kung  has  catalogued  a  family  of  linear  systolic  networks  for  discrete  convolution  in  [7l  He 
refers  to  these  as  "(pure-)  systolic"  convolution  arrays  to  distinguish  them  from  "(semi-)  systolic’  arrays 
in  which  global  fan-in  or  fan-out  is  necessary.  Kung,  however,  has  used  a  nonstandard  definition  of 

convolution  where  Yti]  =  £  Wtj]  x  X[i+j—  1],  We  prefer  to  adhere  to  the  conventional  definition  of 

i 

convolution  since  it  preserves  the  symmetry  (commutativity)  of  the  operation,  Le,  W  «  X  =*  X  *  W. 
Clearly,  convolution  in  the  sense  of  Kung,  which  is  conventionally  referred  to  as  "correlation,"  is 


equivalent  to  convolving  X  delayed  one  time  unit  and  W  reversed  in  time.  In  other  words,  if  Y  is  the 
convolution  of  X  and  W  in  the  9ense  of  Kung,  then  Y  =  U  *  V,  where  Ulk]  -  W[-k]  and  V[k]  -  Xlk-ll 
In  this  thesis,  future  references  to  the  designs  of  Kung  will  incorporate  the  modifications  necessary  to 
compute  the  conventional  convolution. 


One  possible  canonical  convolution  network  is  that  labeled  Rl  (results  stay,  inputs  and  weights 
move  in  opposite  directions)  by  Kung  (Figure  4).  This  network  is  essentially  a  pipelined  implementa¬ 
tion  of  the  defining  equation  of  convolution.  Since  the  network  is  a  linear  array,  the  unit  grid  men¬ 
tioned  above  reduces  to  the  integer  line.  Similarly,  the  velocities  and  distortion  matrices  of  the  data 
flows  reduce  to  rational  scalars. 


W0-W, 

Xo~X, 

Y  -  Y  +  W,Xj 


Figure  4.  General  structure  of  the  canonical  convolver  (Rl) 
and  functional  description  of  modules 


There  are  three  data  flows  in  this  network.  The  elements  of  W  comprise  an  eastward  data  flow, 
the  parameters  of  which  will  be  subscripted  with  w.  The  elements  of  X  comprise  a  westward  data 
flow;  parameters  associated  with  it  will  be  subscripted  with  x.  The  third  data  flow  is  stationary  and 
consists  of  the  elements  of  Y,  which  are  contained  in  the  registers  of  the  PEs;  parameters  associated  with 
this  data  flow  will  be  subscripted  with  y.  As  a  convention,  we  will  define  y  for  all  sequences  (and  vec¬ 
tors)  such  that  y(a{iD  *  L  We  then  obtain  these  values  for  the  data  flow  parameters 


canonical(Rl) 


▼w  *  1,  ■  -1.  ■  0, 

V  —  ^  T  _  4 


If  a  network  in  which  the  weights  are  stationary  is  desired,  we  could  simply  subtract  vw  from  all 
the  velocities  of  the  data  flows  in  the  canonical  design.  The  resulting  data  flow  velocities  are  (distortion 
matrices  remain  unchanged) 

-1  {  vw  =  0,  vx  =  -2,  vy  =  -1. 

In  this  network  (Figure  5),  the  weights  are  indeed  stationary  and  inputs  move  in  the  same  direction  as 
the  results,  but  at  twice  the  speed.  Kung  mentions  this  network  as  a  "dual"  of  design  W2,  although  it 
now  seems  more  closely  related  to  Rl.  Transformation  of  the  module  descriptions  is  trivial  since  the 
only  P£s  in  which  computation  occurs  are  inner  product  step  processors. 


Figure  5.  General  structure  of  the  -1  convolver  and 
functional  description  of  modules 

A  network  with  stationary  inputs  can  be  obtained  by  subtracting  v*  from  the  data  flow  velocities 
of  the  canonical  network.  The  data  flow  velocities  are  then 

+1  {  tw  =  2,  v„  =  0,  vy  =  1. 

In  this  network  (Figure  6),  which  is  not  catalogued  by  Kung,  the  weights  move  in  the  same  direction  as 
the  results,  but  at  twice  the  speed. 


0  >  Y0  —  1 1  +  W,X  V0  *-  Yj 

Figure  6.  General  structure  of  the  +1  convolver  and 
functional  description  of  modules 


Another  canonical  convolution  network  catalogued  by  Kung  is  R2  (results  stay,  inputs  and 
weights  move  in  same  direction,  but  at  different  speeds).  This  design  (Figure  7)  has  the  following  data 
flow  parameters: 


canonical(R2) 


▼*  *  1/2,  vx  =  1,  vy  =  0. 
L.  *  1/2,  L,  *  — 1,  Ly  *  1. 


W, 


X, 


Wr 


Xo 


W0  -  w, 

Xo-X, 

V  *-  Y  +  W,x, 


W, 


W0  -  W1 


Figure  7.  General  structure  of  the  canonical  convolver  (R2) 
and  functional  description  of  modules 


Kurg  mentions  that  this  design  has  a  "dual”  in  which  the  weights  move  twice  as  fast  as  the  inputs. 
This  is  clearly  a  result  of  the  symmetry  of  convolution,  which  allows  us  to  interchange  the  W  and  X 
data  flow  parameters;  in  fact,  the  W  and  X  data  flow  parameters  of  any  systolic  convolution  network 
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can  be  interchanged  to  yield  another  valid  convolution  network. 

Now,  simply  subtracting  vw  from  all  data  flow  velocities  in  the  canonical  network  yields  a  new 
convolution  network  in  which  the  weights  are  stationary.  The  resulting  data  flow  velocities  are 

-1/2  {  =  0,  vx  =  1/2,  Vy  =  -1/2. 

This  design  is  labeled  Wl  (weights  stay,  inputs  and  results  move  in  opposite  directions)  by  Rung  (Fig* 
ure  8). 


x'  , 

”  w 

Xo 

<  Y° 

.V0  —  -\| 

,  Y,  Y0  -  Y,  +  wx, 

Figure  8.  General  structure  of  the  -1/2  convolver  and 
functional  description  of  modules 


Suppose  that  we  subtract  instead  vx  from  the  velocities  of  the  canonical  data  flows.  The  resulting 
data  flow  velocities  are 

-1  (  ▼„  =  —1/2,  vx  =  0,  vy  =  -1. 

In  this  network  (Figure  9),  the  inputs  are  stationary,  and  the  weights  and  results  move  in  the  same 
direction  but  at  different  speeds.  If  we  interchange  the  data  flow  parameters  of  W  and  X,  we  obtain  the 
design  labeled  W2  (weights  stay,  inputs  and  results  move  in  the  same  direction  but  at  different  speeds) 
by  Rung. 

We  might  also  seek  a  transformation  to  demonstrate  the  equivalence  of  Rl  and  R2.  However,  the 
equivalence  of  Rl  and  R2  is  not  an  immediate  consequence  of  affine  transformations  or  the  symmetry 
of  convolution.  One  can  observe  that  the  signs  of  L*,  and  L*  are  the  same  for  Rl;  for  this  reason,  we 


»0 


Y0  -  \x  +  W,X 


w0  -  W, 


Figure  9.  General  structure  of  the  -1  convolver  and 
functional  description  of  modules 


say  that  R1  is  a  cogradient  convolution  network.  Far  R2.  though,  the  signs  of  L»  and  L,  are  opposite: 
for  this  reason,  we  say  that  R2  is  a  contragradient  convolution  network.  Clearly,  no  affine  transforma¬ 
tion  or  simple  interchange  of  data  flow  parameters  will  map  R1  to  R2.  Thus,  it  seems  that  all  of  the 
known  systolic  convolution  designs  lie  in  two  affine  equivalence  classes,  the  class  of  cogradient  designs 
and  the  class  of  contragradient  designs.  These  designs  are  summarized  in  the  table  below  (Figure  10). 
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Figure  10.  Summary  of  the  systolic  convolution  networks 


3A2.  MATRIX  MULTIPLICATION 


Another  problem  for  which  systolic  solutions  have  been  proposed  is  that  of  matrix  multiplication, 
i a.,  given  two  matrices  A  and  B.  find  the  product  matrix  C  -  AB.  The  canonical  systolic  network  for 


matrix  multiplication  is  the  planar  array  of  ’orthogonally-connected''  PEs  (Figure  11).  In  [ill 
Preparata  and  Vuillemin  describe  this  network  (routed  -90*)  and  show  that  its  operation  may  be 
viewed  as  a  pipelined  interaction  of  columns  of  A  with  rows  of  E  This  pipelined  interaction  of 
columns  with  rows  is  a  central  feature  of  many  systolic  compuutions  and  will  be  further  explored 
later.  Since  this  network  is  a  planar  array,  the  unit  grid  becomes  planar,  the  velocities  of  the  dau  flows 
are  two-component  vectors,  and  the  distortion  matrices  are  2x2  matrices. 


Figure  11.  General  structure  of  the  canonical  ([0  OF)  matrix  multiplier 
and  functional  description  of  modules 


There  are  three  dau  flows  in  this  network.  The  northward  dau  flow  consists  of  the  elements  of 
A.  The  eastward  dau  flow  consists  of  the  elements  of  E  The  third  dau  flow  is  stationary  and  consists 
of  the  elements  of  the  product  matrix,  C,  which  are  accumulated  in  the  module  registers.  The  parame¬ 
ters  of  these  dau  flows  will  be  subscripted  with  a,  b,  and  c,  respectively.  In  this  thesis,  we  will  assume 
that  y  is  defined  for  all  matrices  such  that  y(a{i,jD  =  [i  jj1.  The  dau  flow  parameters  are 


▼,  a  [0  lF. 


4  = 


1  0 
-1  -1  ’ 


▼b  *  [i  oF,  vc  =  [o  oF, 


4  = 


-l  -l 
0  1  ’ 


4  = 


l  o 
o  1 


A  number  of  potentially  useful  linear  equivalence  classes  of  networks  for  matrix  multiplication 
may  be  derived  by  applying  transformations  to  this  network,  as  specified  in  Theorem  1.  First,  let  us 
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consider  the  network  obtained  by  adding  the  vector  [—1  — lF  to  all  velocities.  The  new  velocities  are 
(again,  distortion  matrices  remain  unchanged) 

▼.  =  [-1  OP,  vb  =  {0  -lF,  vc  =  [-1  -ip. 

As  in  the  case  of  convolution,  all  the  active  PEs  in  the  original  network  are  inner  product  step  proces¬ 
sors.  Therefore,  all  the  PEs  in  this  network  are  also  inner  product  step  processors.  Because  each  PE  has 
sis  neighbors,  the  network  (Figure  12)  is  said  to  be  “hesagonally-connected." 


Figure  12.  General  structure  of  the  [—1  — lP  matris  multiplier 


Now,  if  we  add  the  vector  [—1/2  — 1/2P  to  all  velocities  in  the  canonical  network,  we  obtain  a  net¬ 
work  (Figure  13)  that  is  again  "orthogonally-connected,"  but  communication  along  one  of  the  axes  is 
bidirectional  now.  The  network  has  these  data  flow  velocities: 

t,  =  [-1/2  1/2P,  vb  =  [1/2  -1/2P,  ve  =  [-1/2  -1/2F. 

Yet  another  network  can  be  obtained  by  adding  the  vector  [—1/3  — 1/3P  to  all  velocities  in  the  canoni¬ 
cal  network.  The  resulting  network  (Figure  14)  is  "hexagonally-connected"  with  the  following  data 
flow  velocities: 


▼.  =  [-1/3  2/3P,  vb  *  [2/3  -1/3P,  rt  -  [-1/3  -1/3P. 


I 
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Figure  14.  General  structure  of  the  [—1/3  — 1/3  F  matrix  multiplier 


It 


is  important  to  note  that  each  of  the  three  matrix  multiplication  networks  derived  above  is 


representative  of  a  linear  equivalence  class  of  systolic  networks:  each  can  be  "redrawn"  in  a  more  con¬ 
ventional  or  pleasing  manner  simply  by  multiplying  the  data  flow  parameters  by  the  appropriate 
matrix.  For  instance,  suppose  we  let 
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M  = 


-3/2  3/2 
-3  -3 


If  we  multiply  the  data  flow  parameters  of  the  [—1/3  —  1/3F  network  by  M,  we  obtain  the  data  flow 
parameters  of  the  Kung-Leiserson  systolic  matrix  multiplier  [8i 


v,  =  [3/2  -lF.  vb  =  [-3/2  -IF,  vc  =  [0  2F. 


-3  -3/2 

3/2  3 

-3/2  3/2 

4  = 

0  3 

*  ** 

3  0 

.  L.  = 

-3  -3 

Note  that  this  network  (Figure  15),  when  specialized  to  the  case  of  banded  matrices  considered  by  Kung 
and  Leiserson,  has  a  parallelogram  shape,  as  did  their  network. 


Figure  15.  General  structure  of  the  Kung-Leiserson  matrix  multiplier 
(generalized  to  dense  matrices) 


3.43.  LU  DECOMPOSITION 

The  solution  of  systems  of  linear  equations  is  another  problem  that  has  been  approached  with  sys¬ 
tolic  techniques.  This  problem  is  usually  posed  in  matrix  form  as  follows:  given  a  nonsmgular  nxn 
matrix  A  and  an  nxm  matrix  C,  find  the  nXm  matrix  B  such  that  C  -  AB.  The  solution  of  this  prob¬ 
lem,  for  general  matrices,  usually  involves  decomposing  A  into  triangular  factors  and  then  solving  tri¬ 
angular  linear  systems,  both  of  which  can  be  done  directly  with  systolic  designs.  Kung  and  Leiserson 
have  presented  a  systolic  network  for  LU  decomposition  in  [8].  This  network  (Figure  16)  computes  two 
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matrices,  L  and  U,  such  that  the  input  matrix.  A,  can  be  expressed  as  A  -  LU,  where  L  is  unit  lower  tri¬ 
angular  and  U  is  upper  triangular. 

In  the  Kung-Leiserson  design,  the  matrix  A  flows  northward  into  a  network  of  "hexagonally- 
connected"  processors.  The  L  and  U  matrices  may  be  retrieved  from  the  network  in  a  variety  of  ways; 
we  will  choose  an  implementation  of  the  Kung-Leiserson  design  that  more  clearly  exhibits  network 
symmetry:  L  flows  southeast  from  the  lower-right  margin  of  the  network,  and  U  flows  southwest 
from  the  lower-left  margin  of  the  network.  These  three  matrices  comprise  the  data  flows  in  the  net¬ 
work.  The  corresponding  data  flow  parameters  will  be  subscripted  with  a,  L,  and  u,  respectively. 
Analysis  of  the  network  yields  the  following  values  for  the  data  flow  velocities  and  distortion 
matrices: 

x,  =  [3/2  -IF,  vu  =  [-3/2  -IF,  v,  =  [0  2F. 
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Figure  16.  The  Kung-Leiserson  network  for  LU  decomposition 


There  are  four  module  types  in  this  network:  the  PE  at  the  top  of  the  network  is  a  T-module; 
the  PEs  at  the  upper-left  margin  (except  the  top)  are  S-modules;  the  PEs  at  the  upper-right  margin 
(except  the  top)  are  R-modules;  all  of  the  other  PEs  are  G-modules.  The  module  descriptions  are  given 


T  (top) 


S  (u.-l.) 


R  (u.-r.) 


G  (others) 


assignment  statements 


1,  U  ♦-  A 


L  -  A/U,  U  -  U 


control  statements 


Figure  17.  Module  descriptions  for  the  Kung-Leiserson  LU  decomposer 


Now,  we  will  derive  a  canonical  network  for  LU  decomposition.  In  this  computation,  the  opera¬ 
tion  is  matrix  multiplication,  the  operands  are  the  L  and  U  matrices,  and  the  result  is  the  matrix  A. 
The  side  conditions  are  that  L  is  unit  lower  triangular  and  U  is  upper  triangular.  Therefore,  the  canon¬ 
ical  design  is  the  one  in  which  the  A  data  flow  is  stationary  and  has  an  identity  distortion  matrix.  It  is 
obtained,  first,  by  adding  — v,  =  [0  — 2F  to  all  data  flow  velocities  and,  second,  by  multiplying  all  data 
flow  velocities  and  distortion  matrices  by  L*-1.  (Note  that  the  Kung-Leiserson  network  is  a  member  of 
the  L,-1v,  =  [—1/3  —  1/3F  linear  equivalence  clas.)  The  results  of  the  transformation  are  as  follows: 

▼/'  =  [0  lF,  vn'  =  [1  OF,  =  [0  OF, 

.  .  1  0  T  <  -1-1  _  .  1  0 

^  _1  _1  -  A-u  “  o  1  •  ”  0  1  • 

This  canonical  network  is  shown  below  (Figure  18). 

We  now  must  describe  the  functions  of  the  PEs  in  the  canonical  network.  In  all  of  the  previously 
discussed  transformations,  obtaining  the  module  descriptions  was  trivial  since  the  networks  had  only 
one  module  type  and  that  module  had  a  single  state.  In  this  case,  however,  the  state  flow  technique 
must  be  employed.  In  the  Kung-Leiserson  network,  the  PEs  do  not  change  state,  so  the  state  flow  has 
zero  velocity  (v,  *  [0  OF).  If  we  take  G,  to  be  the  convex  grid  set  below  (Figure  19),  then 


Figure  18.  The  canonical  network,  for  LU  decomposition 


4  = 


-3/2  3/2 
-1  -1 


Figure  19.  The  convex  grid  set  underlying  the  state  flow 


Transforming  the  state  flow  parameters  as  we  did  the  data  flow  parameters,  we  obtain: 


=  L.'Hv,  -  v.)  * 


-1/3  -1/6 
1/3  -1/6 


x 


1/3 

1/3 


4'  =  L,_1Lj  = 


-1/3  -1/6 
1/3  -1/6 


-3/2  3/2 

-1  -1 


2/3  -1/3 
-1/3  2/3 


We  will  derive  the  module  descriptions  for  the  canonical  network  from  this  transformed  state  flow 
(Figure  20). 


Figure  20.  The  state  flow  of  the  canonical  LL  decomposer 
(at  first  clock  cycle) 


We  can  observe  that  the  PEs  on  the  diagonal  of  the  network  (those  of  the  form  MtuD  can  only 
enter  states  G1  and  Tl.  The  PEs  below  the  diagonal  (those  of  the  form  Mtkjl  i>j)  can  enter  only  Gl 
and  SI.  and  the  PEs  above  the  diagonal  (those  of  the  form  i<  j)  can  enter  only  Gl  and  Rl.  This 
suggests  three  new  module  types:  D-modules  (MlwD  with  states  D1  =  Gl  and  D2  =  Tl,  L-modules 
(M(i,jl  i>  j)  with  states  LI  =  Gl  and  L2  =  SI,  and  U-modules  (M[i,jl  i<j)  with  states  U1  =  Gl  and 
U2  =  Rl.  (Here,  equality  of  states  indicates  computational  equivalence,  Le*  the  assignment  statements 
of  the  states  are  the  same.) 


We  will  now  determine  a  hard-wired  state  transition  scheme  that  realizes  this  state  flow.  Since 
the  only  part  of  the  state  flow  that  is  crucial  to  the  operation  of  the  network  is  that  which  coincides 
with  the  data  flows,  we  will  find  it  convenient  to  assume  that  all  PEs  are  initially  in  the  state 
corresponding  to  Gl,  i a,  D-modules  are  in  state  Dl,  L-modules  are  in  LI,  and  U-modules  are  in  Ul. 
This  state  acts  as  a  quasi-quiescent  state,  in  the  sense  that  the  register  contents  of  a  PE  in  this  state  are 


not  altered  unless  both  the  L  and  U  data  flows  are  flowing  through  the  PE  The  first  PE  to  be  excited 
from  this  state  is  Mtl.ll  which  enters  state  D2.  This  must  be  accomplished  via  an  external  signal. 
After  each  Mlul  enters  this  state,  the  processor  immediately  to  the  right  of  it  enters  L2,  and  the  proces¬ 
sor  immediately  above  it  enters  U2.  Therefore,  we  must  include  in  the  control  statements  of  state  D2  a 
device  to  indicate  that  these  state  transitions  occur.  We  do  this  with  "goto*  control  statements:  a  goto 
preceded  by  a  PE  reference  denotes  a  state  transition  control  signal  to  be  sent  to  a  neighboring  PE  one 
without  a  PE  reference  denotes  a  state  transition  to  be  executed  by  the  PE  itself.  So,  for  state  D2,  we 
include  "Mti+ljfl  goto  L2"  and  *M[y+l]  goto  U2."  The  state  L2  propagates  to  the  east,  so  its  control 
statements  include  "M[i+l,j]  goto  L2."  Similarly,  the  state  U2  propagates  to  the  north,  so  its  control 
statements  include  *M[i,>t-l]  goto  U2."  Immediately  following  these  second  states,  the  PEs  may  return 
to  the  quasi-quiescent  state,  so  we  include  in  the  control  statements  of  D2,  L2,  and  U2  "goto  Dl,"  "goto 
Ll,"  and  "goto  U 1,"  respectively.  The  only  remaining  issue,  then,  is  to  determine  a  mechanism  for  D2  to 
propagate  along  the  network  diagonal.  There  are  several  possible  solutions,  however,  the  most  straight¬ 
forward  of  these  is  to  include  in  the  control  statements  of  state  D2  "Mli+lj+l]  goto  D2  in  3."  By  this, 
we  mean  that  M[i+l,i+l]  is  to  transition  to  state  D2  in  three  clock  cycles.  This  can  be  achieved  by 
buffering  the  control  signal  with  a  two-stage  shift  register.  The  module  descriptions  that  finally 
emerge  are  shown  below  (Figure  21). 


module 


DOvUuD 


L  (M[i,jL  i>  j) 


U  (M[i,jl  i<  j) 


state 

label 

assignment  statements 

Dl: 

L*-L,U«-U,  A«-A  -  LU 

D2: 

L  «-  1,  U  «-  A 

Ll: 

L-L,U-U,A-A-LU 

L2: 

L  -  A/U,  U  -  U 

Ul: 

L-L,U-U,A«-A-LU 

U2: 

L  «-  L,  U  ♦-  A 

control  statements 


Mli+lj)  goto  L2, 
Mlu+l]  goto  U2, 
Mti+li+l]  goto  D2  in  3, 
Dl 


Mti+l,jl  goto  L2, 
to  Ll 


Figure  21.  Module  descriptions  for  the  canonical  LU  decomposer 


3.4.4.  TRIANGULAR  SYSTEM  SOLUTION 


As  was  mentioned  previously,  the  solution  of  triangular  systems  of  linear  equations  is  an  impor¬ 
tant  component  in  the  problem  of  solving  more  general  systems  of  linear  equations.  The  lower- 
triangular  variant  of  this  problem  is  the  following;  given  a  nonsingular,  lower-triangular  matrix  L 
and  a  matrix  Y,  find  the  matrix  X  such  that  Y  -  LX.  In  [8],  Kung  and  Leiserson  propose  a  systolic  net¬ 
work  for  solving  this  problem  in  the  special  case  where  Y  and  X  are  column  vectors.  Here,  we  make 
minor  modifications  to  their  network  and  generalize  it  (the  details  are  omitted  for  brevity)  to  obtain 
the  network  below  (Figure  22),  which  solves  lower-triangular  linear  systems  in  full  generality  (Y  and 
X  are  matrices). 


Figure  22.  A  systolic  network  for  triangular  system  solution 


The  matrix  L  flows  southward  into  the  "orthogonally-connected"  network  of  processors,  Y  flows 
westward  into  the  network,  and  X  flows  eastward  from  the  network.  These,  in  fact,  are  the  three  data 
flows  in  the  network;  their  parameters  will  be  subscripted  with  U  y,  and  x,  respectively.  The  values  are 


▼/  =  [0  -lF.  =  [1  OF, 


L,= 


1 

1 


▼y  =  [-l  OF, 


2  1 
0  -1 


There  are  two  module  types  in  this  network:  the  PEs  at  the  left  margin  are  D-modules;  the 
remaining  ones  are  M- modules.  The  module  descriptions  are  given  below  (Figure  23).  Initially,  all  PEs 
are  in  a  quiescent  state,  state  0.  To  initiate  the  computation,  state  D1  is  externally  excited  in  the 
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uppermost  D-module  (Mll.pD. 


module 

state 

type 

label 

D  (Mtl,jD 

Dl: 

D2: 

M  (M[i,jl  i>l) 

Ml: 

M2: 

assignment  statements 


.  -  Y/L 


.  -  Y/L 


L  —  L,  X  —  X,  Y  —  Y  -  LX 


control  statements 


Mll,j-l]  goto  Dl, 
M[2,j]  goto  Ml, 
goto  D2 


Mfr-t-l,j]  goto  Ml, 
goto  M2 


L  *-  L,  X  •-  X,  Y  *-  Y  -  LX 


Figure  23.  Module  descriptions  for  the  triangular  system  solver 

To  derive  the  canonical  network,  we  observe  that,  for  this  computation,  the  operation  is  again 
matrix  multiplication,  the  operands  are  the  L  and  X  matrices,  and  the  result  is  the  matrix  Y.  Therefore, 
we  add  —  vy  =  [l  OF  to  all  data  flow  velocities  and  then  multiply  all  data  flow  velocities  and  distor¬ 
tion  matrices  by  Ly-1.  This  implies  that  the  original  network  is  a  representative  of  the 
Ly  *vy  =  [—1/2  OF  linear  equivalence  class.  The  parameters  that  we  obtain  for  the  canonical  network 
are 


▼/'  *  [0  lF,  =  [l  OF,  V  =  [0  OF. 

L/=  1  °|  -1  L 1  ° 

**  — i  —ip1-*  o  i  r  u  oi 


Once  again,  the  canonical  network  has  an  ’orthogonally-connected"  architecture  (Figure  24). 


>#*#« 

iCttfl 

!■■■ 


Figure  24.  The  canonical  network  for  triangular  system  solution 


.*  v  '.'  .Vx  v- 


Here  too,  we  must  employ  the  state  flow  technique  to  obtain  the  module  descriptions  for  the 
canonical  network-  The  first  step  is  to  obtain  a  state  flow  for  the  original  [—1/2  OF  network-  Observ¬ 
ing  that  state  D1  is  first  excited  in  M[l,p]  and  then  propagates  to  M[l,p-l]  and  then  to  M[l,p-2]  and  so 
forth,  suggests  a  state  flow  headed  by  D1  with  velocity  [0  — lF-  Id  fact,  since  Dl  =  D2  and  Ml  =  M2, 
the  state  flow  is  essentially  the  same  as  the  L  data  flow,  with  state  Dl  occupying  the  positions  of  the 
diagonal  elements  and  state  Ml  occupying  the  other  positions  (Figure  25).  If  we  choose  G,  as  shown 
below  (Figure  26),  then  we  have  these  state  flow  parameters: 

1  -1 
1  1  ' 

Performing  the  transformation  on  these  parameters: 


<  =  Ly'Kv,  -  Vy) 


V  =  Ly_1  L, 


1/2  1/2 
0  -1 


0 

-1 


-1 

0 


1/2  1/2 

1  -1 

1  0 

0  -1 

X 

1  1 

-1  -1 

We  will  derive  the  module  descriptions  for  the  canonical  network  from  this  transformed  state  flow 
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Figure  27.  The  state  flow  of  the  canonical  triangular  system  solver 
(at  fourth  clock  cycle) 


We  now  note  that  the  PEs  on  the  left  margin  of  the  network  (those  of  the  form  Mtl,j])  only  enter 
state  Dl.  All  other  PEs  (those  of  the  form  M[i,jl  i>  l)  can  enter  state  Dl  or  Ml.  This  suggests  two  new 
module  types:  A-modules  (M(l,jD  with  a  single  state  A1  =  Dl  and  B-modules  (M[i,jL  i>l)  with  states 
B1  =  Ml  and  B2  =  Dl. 

As  before,  we  must  determine  a  hard- wired  state  transition  scheme  that  realizes  this  state  flow. 
We  assume  that  all  PEs  are  initially  in  the  quiescent  state,  state  0.  The  first  PE  to  be  excited  from  this 
state  (via  external  signal)  is  M(l,ll  which  enters  state  Al.  State  A1  then  propagates  northward,  so  we 
include  *M(l,j-t-l]  goto  Al"  in  the  control  statements  of  state  Al.  One  cycle  after  each  A-module  enters 
state  Al,  the  B-module  immediately  to  its  right  enters  state  Bl;  two  cycles  after,  the  B-module  enters 
state  B2.  Therefore,  we  also  include  *Ml2,j]  goto  Bl"  and  "M(2,j]  goto  B2  in  2"  in  the  control  statements 
of  state  Al.  State  Bl  then  propagates  eastward,  so  we  include  *Mti+l,j]  goto  Bl"  in  the  control  state¬ 
ments  of  state  Bl.  State  B2  also  propagates  eastward,  however  at  a  rate  of  1/2  processor  per  cycle. 
Therefore,  we  also  include  *M[i+l,j3  goto  B2  in  2"  in  the  control  statements  of  state  B2.  All  PEs  are 


appropriately  returned  to  the  quiescent  state  by  including  "goto  0"  in  the  control  statements  of  states 
A1  and  B2.  Thus,  we  derive  the  module  descriptions  below  (Figure  28). 


module 

type 

state 

label 

assignment  statements 

control  statements 

A  (M[l,j]) 

Al: 

L  -  L,  X  -  Y/L 

Mli.j+1]  goto  Al, 

M(2,j]  goto  Bl, 

M[2,j]  goto  B2  in  2, 
goto  0 

B  (Mfcjl  i>l) 

Bl: 

L‘-L,X*-X,Y  —  Y  -  LX 

Mti+l,j]  goto  Bl 

B2: 

L  «-  L,  X  —  Y/L 

Mli+l,j]  goto  B2  in  2, 
goto  0 

Figure  28.  Module  descriptions  for  the  canonical  triangular  system  solver 


An  interesting  transformation  of  the  canonical  triangular  system  solver  is  one  from  which  the 
resulting  state  flow  has  zero  velocity.  The  networks  with  zero  state  flow  velocity  are  those  of  the 
=  [0  — lF  linear  equivalence  class  and  have  the  property  that  no  PE  changes  state.  These  networks 
therefore  have  two  module  types,  one  corresponding  to  state  A1  -  B2  and  one  corresponding  to  state  Bl. 
Since  no  state  transitions  occur,  the  state  descriptions  of  these  modules  do  not  contain  any  control  state¬ 
ments,  ue,  no  control  signals  are  transmitted  through  the  network. 

To  exemplify  such  a  network,  we  will  add  — to  all  velocities  and  then  multiply  all  velocities 
and  distortion  matrices  by  M  *  L,-1.  The  resulting  parameters  are 


▼/  =  (o  oF,  ▼«'  =  [i  oF,  V  =  [o  iF,  =  [o  oF, 


1  0 

-1  -1 

1  0 

1  0 

U  = 

0  1 

.  W  = 

1  0 

,  V  = 

-l  -i 

.  V- 

0  1 

The  resulting  network  is  shown  (Figure  29),  and  the  module  descriptions  are  given  (Figure  30).  Again, 
we  note  that  the  choice  of  M  does  not  alter  the  topology  of  the  network;  it  is  simply  chosen  so  that  the 
network  can  be  drawn  in  a  more  convenient  manner  (in  this  case,  with  the  same  layout  as  the  underly¬ 
ing  convex  grid  set,  G,). 


module 

type 

state 
■  label 

assignment  statements 

control  statements 

E(M[iiD 

El: 

L  -  L.  X  ~  Y/L 

F  (M[i.jl  i>j) 

Fl: 

L-L.X-X.Y-  Y-LX 

Figure  30.  Module  descriptions  for  the  [0  lF  triangular  system  solver 


The  previous  discussion  of  systolic  triangular  system  solvers  was  pertinent  to  the  lower-triangular 
variant  of  the  problem.  The  following  trivial  observation,  however,  allows  us  to  apply  these  results  to 
the  solution  of  upper-triangular  systems.  If  U  is  a  nonsingular,  upper-triangular  matrix,  then  solving 
the  linear  system  W  =  UV  can  be  reduced  to  solving  the  lower-triangular  system  Y  =  LX.  where 
L  =  RUR-1,  Y  *  RW.  X  =  RV.  and  R  is  a  reverse  permutation  matrix,  i-e., 


R  * 


.  0  1 

.  1  0 


0  1 
1  0 


0 


Another  modification  of  this  problem  is  to  set  Y  »  I  (by  hardwiring,  perhaps)  in  order  to  handle  an 
important  special  case,  triangular  matrix  inversion.  One  such  systolic  network  (for  upper-triangular 
matrices)  has  been  proposed  by  Preparata  and  Vuillemin  [lOj. 


3.5.  CROSSINGS  IN  TRANSFORMED  NETWORKS 

3.5.1.  NECESSARY  AND  SUFFICIENT  CONDITIONS  FOR  CROSSING 

,4s  one  experiments  with  these  transformations  on  two-dimensional  systolic  networks,  one  may 
observe  that  many  of  them  result  in  networks  with  communication  edges  that  cross.  Formally,  a  cross¬ 
ing  is  the  intersection  of  two  nonparallel  communication  edges  that  have  distinct  endpoints.  For 
instance,  if  we  transform  the  canonical  matrix  multiplier  by  adding  the  vector  [—1/4  — 1/4]1  to  all 
data  flow  velocities,  we  obtain  a  systolic  network  with  crossings  (Figure  31).  Networks  with  such 
crossings  are  somewhat  undesirable  since  they  are  no  longer  planarly  embeddable  in  the  grid.  This  pro¬ 
vides  the  motivation  to  characterize  the  conditions  that  cause  crossing. 


Figure  31.  A  [—1/4  — 1/4]1  matrix  multiplier 

We  now  establish  a  necessary  condition  for  crossings  in  a  systolic  network. 

Lemma  1: 

In  a  connected  systolic  network  (one  in  which  the  underlying  undirected  graph  is  con¬ 
nected)  with  m  data  flows  having  velocities  Vj,  Vj,  •  •  •  ,vm,  a  crossing  exists  only  if  there 
exists  an  m-component  vector,  x,  in  the  null  space  of  V  =  [vL  •  •  •  vm]  with  exactly  one  or 
two  noninteger  components  corresponding  to  nonzero,  linearly  independent  columns  of  V. 


Proof:  Consider  two  edges  e,  (p„p,  +  vt),  and  f.  (pf,p*  +  Vj),  that  cross  at  a  point  p.  We 
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then  have  the  following: 

P«  +  £.▼,  =  Pf  +  ffTj  =  P.  where  6  [0,1  ] 
p«  -  Pf  +  £'V,  -  =  o 

Because  pf  and  p«  are  both  vertices  of  the  systolic  network,  and  because  the  systolic  network 
is  connected,  some  sequence  of  edges  (some  possibly  traversed  in  the  reverse  direction)  will 
form  a  path  from  pt  to  pf.  Therefore,  p,  —  pf  =  Vx',  where  x'  is  a  vector  of  integers  and 

Vx'  +  &v,  -  =  0. 

If  we  let  x  =  x'  +  |te,  —  where  ek  is  the  unit  vector  with  the  kth  coordinate  equal  to  1, 
then  Vx  =  0,  ue^  x  is  in  the  null  space  of  V.  Since  e  and  f  have  distinct  endpoints, 
£e  €  (0,1)  or  £{  €  (0,1),  so  x  has  one  or  two  noninteger  components.  If  x  has  one  noninteger 
component,  that  component  must  not  correspond  to  a  zero  column  of  V,  otherwise,  the  dis¬ 
tinct  endpoint  condition  is  violated.  If  x  has  two  noninteger  components,  then  they  must 
not  correspond  to  linearly  dependent  columns  of  V  since,  by  definition,  a  crossing  cannot 
occur  between  two  parallel  edges.  □ 

We  can  also  establish  a  sufficient  condition  for  crossings  in  a  systolic  network. 

Lemma  2: 

In  a  systolic  network  in  which  each  nonboundary  PE  has  communication  edges  correspond¬ 
ing  to  every  data  flow,  a  crossing  exists  if  there  exists  an  m -component  vector,  x.  in  the  null 
space  of  V  with  exactly  one  or  two  noninteger  components  corresponding  to  nonzero, 
linearly  independent  columns  of  V. 

Proof:  Suppose  that  x  is  a  vector  with  two  noninteger  components,  x,  and  xr  satisfying  the 
above  criteria.  Let  =  x,  —  [xj,  =  [xj  —  x,  and  x'  =  x  —  |,e,  +  x  is  therefore  a  vec¬ 
tor  of  integers.  Let  p,  be  the  position  of  an  arbitrary  nonboundary  PE  such  that 
Pf  =  p,  —  Vx'  lies  within  the  boundaries  of  the  systolic  network-  Since  each  non  boundary 
PE  has  communication  edges  corresponding  to  every  data  flow,  a  simple  inductive  argument 


shows  that  any  integer  linear  combination  of  the  data  flow  velocities,  when  added  to  p„ 
yields  the  position  of  another  PE  (as  long  as  this  combination  lies  within  the  boundaries  of 
the  network).  In  particular,  then,  pf  is  also  the  position  of  a  nonboundary  PE  The  follow¬ 
ing  argument  shows  that  edge  e,  (pe,pe  +  ▼*).  must  cross  edge  f,  (pf,pf  +  vp: 

Pf  =  P,-Vx' 

Pf  =  P.  -  V(x  -  |ee,  + 

Pf  *  Pe  “  Vx  +  |eVe,  -  HfVtj 
Pf  =  p«  +  !«▼.  ~  If 
Pf  +  If^j  =  p«  +  !.▼, 

A  similar  argument  can  be  applied  when  x  has  only  one  noninteger  component.  □ 

Combining  these  two  results  in  a  trivial  manner  yields  the  following  theorem. 

Theorem  3: 

In  a  connected  systolic  network  in  which  each  nonboundary  PE  has  communication  edges 
corresponding  to  every  data  flow,  a  crossing  exists  if  and  only  if  there  exists  an  m  com¬ 
ponent  vector,  x,  in  the  null  space  of  V  with  exactly  one  or  two  noninteger  components 
corresponding  to  nonzero,  linearly  independent  columns  of  V. 

Now,  we  will  utilize  this  result  to  study  the  affine  transformations  described  earlier  and  their  effect  on 
crossing  in  systolic  networks. 

3-5.2.  THE  EFFECT  OF  AFFINE  TRANSFORMATIONS  ON  CROSSING 

First,  let  us  examine  the  effect  of  Theorem  1  transformations  on  crossing.  Suppose  we  are  given  a 
crossing-free  systolic  network  with  V  as  its  matrix  of  data  flow  velocities.  Let  x  be  a  vector  with 
S*  -  £xk  0  and  two  noninteger  components,  Xj  and  xr  We  can  select  a  u  that  induces  crossings 

k 

when  added  to  all  data  flow  velocities  in  the  following  manner.  Since  the  original  network  is 
crossing-free,  Vx  ^  0.  If  V'  represents  the  transformed  matrix  of  data  flow  velocities,  then 
V'  =  V  +  U,  where  U  =  [u  u  •  •  •  u). 


V'x  *  (V  +  U)x 
V'x  *  Vx  +  Ux 


V'x  =  Vx  +  S*u 

Therefore,  if  we  choose  u  =  — Vx/S*,  then  V'x  =  0.  This  implies  that  a  crossing  exists  unless  v,'  is 
parallel  to  ▼/,  which,  in  general,  is  not  the  case.  This  demonstrates  the  fact  that  a  crossing-free  systolic 
network  may  be  transformed  to  one  with  crossings  according  to  the  rules  of  Theorem  1.  Since  this 
transformation  is  invertible,  it  also  follows  that  a  systolic  network  with  crossings  may  be  transformed 
to  a  crossing-free  one.  Thus,  crossing-freedom  is  variant  with  respect  to  Theorem  1  transformations 
(which  was  expected  since  our  example  of  a  systolic  network  with  crossings  was  derived  by  this  type 
of  transformation  on  a  crossing-free  systolic  network). 

Now,  we  will  examine  the  effect  of  Theorem  2  transformations  on  crossing.  Let  V  be  the  matrix 
of  data  flow  velocities  of  the  original  network,  and  V'  *  MV  be  the  matrix  of  data  flow  velocities  of 
the  resulting  network.  If  Vx  =  0,  then  V'x  =  MVx  =  MO  =  0.  If  Vx  »  w  0,  then 
V  x  =  MVx  =  Mw  ^  0  since  M  is  nonsingular  (according  to  Theorem  2).  Thus,  crossing-freedom  is 
invariant  with  respect  to  Theorem  2  transformations. 

As  an  example,  suppose  we  apply  Theorem  3  to  a  suitably  restricted  systolic  network  with  the 
following  matrix  of  data  flow  velocities; 

|°  1  0 

V  =  j  o  o  ^or  ^nstance*  the  canonical  matrix  multiplier). 

If  Vx  =  0,  then  Xj  *  0,  x2  =  0,  and  x3  is  unconstrained,  so  we  may  choose  x3  to  be  noninteger.  This 
component,  however,  corresponds  to  ▼>  which  is  a  zero  column  of  V.  Thus,  no  x  can  be  chosen  to 
satisfy  the  criteria  of  Theorem  3,  and  the  network  is  crossing-free,  as  is  evident  (Figure  7).  Now,  let  us 
transform  this  network  by  adding  u  =  [—3/2  —1/2 J1  to  all  data  flow  velocities; 


Theorem  3,  and  the  transformation  results  in  crossings  (Figure  32). 


Figure  32.  A  [—3/2  — 1/2F  matrix  multiplier  (showing  crossings) 


3-5-3.  ENUMERATION  OF  CROSSING-FREE  CANONICAL  CLASSES 

The  previous  discussion  leads  us  naturally  to  ask  which  of  the  linear  equivalence  classes  of  matrix 
multipliers  is  crossing-free.  When  we  perform  a  Theorem  1  transformation  on  the  canonical  design,  we 
obtain  a  new  matrix  of  data  flow  velocities,  V'  *  V  +  U,  Le* 

Ux  1+Uj  Uj 
1+U2  u2  u2 

From  Theorem  3,  we  know  that  crossings  will  exist  if  and  only  if  there  exists  a  vector  in  the  null  space 
of  V'  with  one  or  two  noninteger  components.  The  null  space  of  V'  is  a  one-dimensional  subspace  of 
]R3  as  long  as  V*  has  full  rank,  which  must  be  the  case.  Therefore,  the  null  space  of  V  can  be 
represented  as  the  range  space  of  a  3X1  matrix,  W,  where  V'W  =  0.  If  we  partition  V*  into  [V'i  V'2I 
where  V'i  =  v/  and  V*2  =  [v2'  ▼j'l  we  can  rewrite  this  equation  in  the  following  way: 

\rw  =  [v^vj  =  r ,w,  +  v*2w2  =  o. 

If  t2'  and  r3'  are  linearly  independent,  then  we  can  choose  a  form  for  W  with  W3  =  1.  Then  we 
must  have  W2  =  — (V' 2)~,V' x.  If  W2  contains  a  noninteger,  then  x  =  W  clearly  satisfies  the  criteria  of 
Theorem  3,  and  a  crossing  must  exist.  Otherwise,  let  r  -  maxllwJ,  tw3l|,  Le*  r  is  the  maximum 


magnitude  of  the  elements  of  Wj.  If  r  >  1,  then  x  «  W  satisfies  the  criteria  of  Theorem  3  and  a  cross¬ 
ing  must  exist.  The  only  remaining  choices  for  W2  are  the  2X1  matrices  over  1-1,  0,  +1};  these  are 

[0  OF,  ±[0  IF.  ±(1  OF,  ±[1  lF.  and  ±[l  -lF. 

We  will  now  examine  each  of  these  possibilities  on  a  case-bv-case  basis. 

w,  =  — (v'2)~1'v/1  =  - 

Ui  — (l+Uj+Uo) 

w2  =  — ,  w3  = - — 

u2  u2 

Solving  for  u. 


1+Uj 

«1 

-1 

Ui 

u2 

u. 

l+u2 

— w2 

TT+wp-wTT 


-1 

(l+w2+w>) ' 


W2  =  [0  OF:  u  =  [0  -IF 
W2  =  [0  lF:  u  =  [0  -1/2F 

W2  »  [0  — lF:  u  is  undefined  for  this  choice 

W2  =  [1  OF:  u  =  [-1/2  -1/2F 

W2  =  [—1  OF:  u  is  undefined  for  this  choice 

W2  =  [1  lF:  u  =  [-1/3  -1/3F 

W2  «  [-1  -lF:  u  =  [-1  lF 

W2  =  [1  -lF:  u  =  [-1  -lF 

W2  =  [-1  lF:  U  =  [1  -lF 

Thus,  there  are  seven  linear  equivalence  classes  of  crossing-free  matrix  multipliers  with  v2’  and  v3’ 
linearly  independent. 

We  can  apply  the  same  argument  as  above  when  and  v3'  are  linearly  independent  simpiy  by 


permuting  the  columns  of  V*  (first  and  second  are  interchanged).  Proceeding  in  this  manner,  we  obtain 
for  the  same  possible  choices  of 


40 


J 


[l+w2+w3 


~w2 

-w2+w3 


Therefore,  the  set  of  vectors  we  obtain  is  the  same  as  the  set  above  except  that  the  first  and  second  com¬ 
ponents  are  interchanged-  This  actually  yields  only  two  new  vectors,  [—1  OF  and  [—1/2  OF* 

We  have  considered  all  possible  cases  except  that  in  which  v2‘  and  v3'  are  linearly  dependent  and 
v/  and  v3'  are  linearly  dependent.  Since  V  must  have  full  rank,  this  can  be  true  only  if  v3'  =  0  =  v3. 
Indeed,  this  corresponds  to  the  canonical  network.  Therefore,  all  together,  there  are  ten  linear 
equivalence  classes  of  systolic  matrix  multipliers. 

3.5.4.  TWO  DIMENSIONAL  NETWORKS  WITH  FOUR  OR  MORE  DATA  FLOWS 

Theorem  3  can  also  be  used  to  show  that  certain  broad  classes  of  two-dimensional  systolic  net¬ 
works  must  have  crossings.  The  following  theorem  summarizes  this  interesting  result. 


Theorem  4: 


A  two-dimensional  systolic  network  (restricted  as  per  Theorem  3)  with  four  or  more  pair- 
wise  linearly  independent  data  flow  velocities  must  exhibit  crossings. 


Proof:  Let  us  assume  that  vlt  ▼>  and  v4  are  pairwise  linearly  independent.  Since  the 
crossings  found  by  considering  these  four  data  flows  alone  are  a  subset  of  all  the  crossings  in 
the  network,  we  can,  without  loss  of  generality,  consider  the  case  of  exactly  four  data  flows: 


V  =  [tj  t2  t3tJ  = 


Vll  v12  vI3  Vl4 
V21  V22  V23  V24 


We  know  from  Theorem  3  that  if  a  four-component  vector,  x,  in  the  null  space  of  V  with 
exactly  one  or  two  noninteger  components  can  be  found,  then  crossings  must  exist  in  the  net¬ 
work  (since  we  already  know  that  the  columns  of  V  are  nonzero  and  nonparallel). 


The  null  space  of  V  is  a  two-dimensional  subspace  of  1R4  and,  as  such,  can  be  characterized  as 
the  range  space  of  a  4x2  matrix,  W,  where  VW  =  0.  We  will  rewrite  this  equation  in  the 
following  way: 


-  »  ,  «  ■»  •  I  -  •  -  »  '  ■.*  *  *  •.*  ^  «L*'  ‘  *■  J  >“  *  ■  V*  ■  '  «  *  •-*  V*  ■ 


’  •  *  v'  “  A’  V* 


vw  =  [V,  Vj  =  V,Wf  +  V2W,  =  0.  where  V,  =  [v,  v2]  a nd  V2  =  [v3  vj. 

As  before,  we  can  let  W,  =  I  and  W2  =  — V2_1Vi.  Since  the  columns  of  V  are  pairwise 
linearly  independent,  and  V2  have  full  rank.  Therefore,  W2  exists  and  has  full  rank. 
Thus,  all  vectors  in  the  null  space  of  V  have  the  form  x  =  Wy,  where  y  €  IR2. 

Arguing  as  we  did  in  enumerating  crossing-free  matrix  multipliers,  we  restrict  the  choices 
for  W2  to  those  2x2  matrices  over  {-1,  0,  +1 1.  We  can  eliminate  those  containing  0  since  the 
corresponding  column  of  W  would  contradict  the  pairwise  linear  independence  hypothesis. 
The  only  2x2  matrices  over  1-1,  +1}  not  eliminated  by  the  fact  that  W2  has  full  rank  are 


-1  -1 

-1  -1 

-1  +1 

-1  +1 

± 

-1  +1 

,  ± 

+1  -1 

,  ± 

-1  -1 

,  and  ± 

+1  +1 

In  each  of  these  cases,  x  =  W[l/2  1/2F  satisfies  the  conditions  of  Theorem  3.  Thus,  in  all 
possible  cases,  some  x  in  the  null  space  of  V  can  be  found  with  exactly  one  or  two  nonin¬ 
teger  components,  and  a  crossing  must  exist.  □ 

It  is  important  to  note,  however,  that  if  we  remove  the  restriction  that  all  nonboundary  PEs  must  have 
communication  edges  corresponding  to  every  data  flow,  we  can  construct  crossing-free  systolic  networks 
with  four  pairwise  linearly  independent  data  flow  velocities  (Figure  33). 


MM 

rafeSsSH 


4.  CONCLUSION 


In  this  thesis,  we  have  abstracted  the  parameters  that  we  feel  are  important  in  specifying  a  sys¬ 
tolic  design.  We  have  given  a  simple  set  of  rules  for  transforming  these  parameters  while  preserving 
the  underlying  computation.  In  addition,  we  have  shown  how  to  derive  a  description  of  the  processors 
in  the  resulting  design  through  a  state  flow  analysis.  Finally,  we  have  characterized  those  transforma¬ 
tions  that  avoid  the  phenomenon  of  crossing. 

We  feel  that  these  transformations  may  be  useful  to  the  designer  implementing  a  systolic  array  as 
a  means  of  tailoring  the  array  to  a  specific  application,  e-g.,  to  avoid  preloading  data  registers  (make  all 
data  flows  have  nonzero  velocity),  to  ensure  that  processors  need  not  change  state  (make  the  state  flow 
have  zero  velocity),  to  make  the  array  more  compatible  with  solving  partitioned  problems  of  larger  size 
(guarantee  that  no  subset  of  the  data  flows  forms  a  cycle),  to  make  the  array  pipelinable  (add  an  extra 
component  to  all  parameters  and  add  a  velocity  which  is  nonzero  in  this  component).  Furthermore, 
these  transformations  may  prevent  researchers  from  "reinventing"  equivalent  systolic  arrays  in  an  ad 
hoc  manner.  Perhaps  the  most  interesting  topic  for  further  research  would  be  a  characterization  of 
those  problems  or  algorithms  which  are  amenable  to  systolic  processing.  This  elusive  goal  would 
further  unify  the  theory  of  systolic  arrays. 
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